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Abstract 

We consider the problem of constructing binary codes to recover from A:—bit deletions with efficient 
encoding/decoding, for a fixed k. The single deletion case is well understood, with the Varshamov- 
Tenengolts-Levenshtein code from 1965 giving an asymptotically optimal construction with « 2 n /n 
codewords of length n, i.e., at most logu bits of redundancy. However, even for the case of two deletions, 
there was no known explicit construction with redundancy less than n st 1 ’. 

For any fixed k, we construct a binary code with q log n redundancy that can be decoded from k dele¬ 
tions in Ofc(nlog 4 «) time. The coefficient q can be taken to be Oik 2 log A), which is only quadratically 
worse than the optimal, non-constructive bound of O(k). We also indicate how to modify this code to 
allow for a combination of up to k insertions and deletions. 

We also note that among linear codes capable of correcting k deletions, the (k+ l)-fold repetition 
code is essentially the best possible. 


1 Introduction 

A k-bit binary deletion code of length N is some set of strings C C {0, I }' v so that for any ci,C 2 € C, the 
longest common subsequence of ci and C 2 has length less than N — k. For such a code, a codeword of C 
can be uniquely identified from any of its subsequences of length N — k, and therefore such a code enables 
recovery from k adversarial/worst-case deletions. 

In this work, we are interested in the regime when k is a fixed constant, and the block length N grows. 
Denoting by de\(N,k) the size of the largest k-bit binary deletion code of length N, it is known that 

2 n 2 n 
Uk ]^2k ^ del ( N ’ k ) ^ A kjjk 

for some constants at > 0 and A/, < oo depending only k (Tj. 

For the special case of k = 1, it is known that del ( N , 1) = ®(2 N /N). The Varshamov-Tenengolts code @ 
defined by 

r N t 

<{ (x,,...,vv) F {0,1}^ | J^ixi = () (mod (N + 1)) > (2) 
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is known to have size at least 1 N /(N + 1), and Levenshtein HI shows that this code is capable of correcting 
a single deletion. An easy to read exposition of the deletion correcting property of the VT code can be found 
in the detailed survey on single-deletion-correcting codes Q. 

The bound CD) shows that the asymptotic number of redundant bits needed for correcting Ar-bit deletions 
in an /V-bit codeword is 0(Alog N) (i.e., one can encode n = N — 0(Alog N) message bits into length N 
codewords in a manner resilient to k deletions). Note that the codes underlying this result are obtained by an 
exponential time greedy search — an efficient construction of Ar-bit binary deletion codes with redundancy 
approaching 0(k\ogN) was not known, except in the single-deletion case where the VT code gives a solution 
with optimal log N + 0(1) redundancy. 

The simplest code to correct k worst-case deletions is the ( k + l)-fold repetition code, which maps n 
message bits to N = (k + 1 )n codewords bits, and thus has redundant bits. A generalization of the VT 
code for the case of multiple deletions was proposed in |4] and later proved to work in @. These codes 
replace the weight i given to the Tth codeword bit in the check constraint of the VT code [2]) by a much larger 
weight, which even for the k = 2 case is related to the Fibonacci sequence and thus grows exponentially in i. 
Therefore, the redundancy of these codes is Fl{N) even for two deletions, and equals c^N where the constant 
Ck —>• 1 as k increases. Some improvements were made for small k in |[6]|, which studied run-length limited 
codes for correcting insertions/deletions, but the redundancy remained £l(N) even for two deletions. 

Allowing for 0(/V) redundancy, one can in fact efficiently correct a constant fraction of deletions, as 
was shown by Schulman and Zuckerman 0. This construction was improved and optimized recently in 
ll8l . where it was shown that one could correct a fraction £ > 0 of deletions with 0(y/~£N) redundant bits in 
the encoding. One can deduce codes to correct a constant k number of deletions with redundancy Ok{VN) 
using the methods of 0 (we will hint at this in Section [PI) . 

In summary, despite being such a natural and basic problem, there were no known explicit codes with 
redundancy better than a /N even to correct from two deletions. Our main result, stated formally as Theo- 
rem[l]below, gives an explicit construction with redundancy 0/c(log/V) for any fixed number k of deletions, 
along with a near-linear time decoding algorithm. 

For simplicity, the above discussion focused on the problem of recovering from deletions alone. One 
might want codes to recover from a combination of deletions and insertions (i.e., errors under the edit 
distance metric). Levenshtein [11 showed that any code capable of correcting k deletions is in fact also 
capable of correcting from any combination of a total of k insertions and deletions. But this only concerns 
the combinatorial property underlying correction from insertions/deletions, and does not automatically yield 
an algorithm to recover from insertions/deletions based on a deletion-correcting algorithm. For our main 
result, we arc able to extend our construction to efficiently recover from an arbitrary combination of worst- 
case insertions/deletions as long as their - total number is at most k. 

1.1 Our result 

In this work, we construct, for each fixed k, a binary code of block length N for correcting k inser¬ 
tions/deletions on which all relevant operations can be done in polynomial (in fact, near-linear) time and 
that has (9 (A 2 log A log (V) redundancy. We stress that this is the first efficient construction with redundancy 
smaller than N even for the 2-bit deletion case. For simplicity of exposition, we go through the details 
on how to construct an efficient deletion code, and then indicate how to modify it to turn it into an efficient 
deletion/insertion code. 

Theorem 1 (Main). Fix an integer A + 2, For all sufficiently large n, there exists a code length N + n + 
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0(k 2 \ogk\ogn), an injective encoding map Enc : {0,1}" —>• {0,1} W and a decoding map Dec : {0,1}^ k —>• 
{0,1}" U {Fail} both computable in Cb(/i(log/i) 4 ) time, such that for all s G {0,1}" and every subsequence 
s' G {0, I} v k obtained from Enc(.v) by deleting k bits, Decf.v'j = s. 

Note that the decoding complexity in the above result has a FPT (fixed-parameter tractable) type depen¬ 
dence on k, and a near-linear dependence on n. 

Our encoding function in Theorem Q] is non-linear. This is inherent; in Appendix lAl we give a simple 
proof that among linear codes capable of correcting k deletions, the (k + l)-fold repetition code is essentially 
the best possible. 

1.2 Our approach 

We describe at a high level the ideas behind our construction of k-bit binary deletion codes with logarithmic 
redundancy. The difficulty with the deletion channel is that we don’t know the location of deletions, and 
thus we lose knowledge of which position a bit corresponds to. Towards identifying positions of some 
bits in spite of the deletions, we can break a codeword into blocks ai,a 2 ,---,a m of length b bits each, and 
separate them by introducing dummy buffers (consisting of a long enough run of 0’s, say). If only k bits are 
deleted, by looking for these buffers in the received subsequence, we can identify all but 0(k) of the blocks 
correctly (there are some details one must get right to achieve this, but these are not difficult). If the blocks 
arc protected against 0(F) errors, then we can recover the codeword. In terms of redundancy, one needs 
at least m bits for the buffers, and at least kl(kb) f b bits to correct the errors in the blocks. As mb = n, 
such a scheme needs at least FL(\Jn) redundant bits. Using this approach, one can in fact achieve ~ \fkn 
redundancy; this is implicit in |[8J. 

To get lower redundancy, our approach departs from the introduction of explicit buffers, as they use 
up too many redundant bits. Our key idea is to use patterns that occur frequently in the string themselves 
as “implicit” buffers, so we have no redundancy wasted for introducing buffers. Since an adversary could 
delete a portion of an implicit buffer thereby foiling our approach, we use multiple implicit patterns and 
form a separate “hash” for each pattern (which will protect the intervening blocks against k errors). Since 
many strings have very few short patterns (such as as the all 0’s string), we first use a pattern enriching 
encoding procedure to ensure that there are sufficiently many patterns. The number of implicit patterns is 
enough so that less than half of them can be corrupted by an adversary in any choice of k deletions. Then, 
we can decode the string using each pattern and take the majority vote of the resulting decodings. The final 
transmitted string bundles the pattern rich string, a hash describing the pattern enriching procedure, and 
the hash for each pattern. The two hashes are protected with a less efficient k-bit deletion code (with o(n) 
redundancy) so that we can immediately decode them. 

1.3 Deletion codes and synchronization protocols 

A related problem to correcting under edit distance is the problem of synchronizing two strings that are 
nearby in edit distance or document exchange @. The model here is that Alice holds a string x € {0,1}' ! 
and Bob holds an arbitrary string y at edit distance at most k from x — for simplicity let us consider the 
deletions only case so that y € {0, \ } n ~ k is a subsequence of x. The existential result for deletion codes 
implies that there is a short message g(x) € {0, \}°( klo s n ) that Alice can send to Bob, which together with y 
enables him to recover x (this is also a special case of a more general communication problem considered 
in mol). However, the function g takes exponential time to compute. We note that if we had an efficient 
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to compute g with output length Of/:log/;), then one can also get deletion codes with small redundancy by 
protecting g(x) with a deletion code (that is shorter and therefore easier to construct). Indeed, this is in effect 
what our approach outline above does, but only when x is a pattern rich string. Our methods don’t yield 
a deterministic protocol for this problem when x is arbitrary, and constructing such a protocol with n oi 11 
communication remains open. 

If we allow randomization, sending a random hash value h(x) of (){k log//) bits will allow Bob to cor¬ 
rectly identify x among all possible supersequences of y; however, this will take n 0lk] time. Randomized 
protocols that enable Bob to efficiently recover x in near-linear time are known, but these require larger 
hashes of size 0(klog 2 //log* n) ifTTll or 0(k log (n/k) log//) fl2l . Very recently, arandomized protocol with 
a 0(/r log/i) bound on the number of bits transmitted was given in IT3l . But the use of randomness makes 
these synchronization protocols unsuitable for the application to deletion codes in the adversarial model. 

1.4 Organization 

In Section [2l we define the notation and describe some simple or well-known codes which will be used 
throughout the paper. Section [3] demonstrates how to efficiently encode and decode pattern rich strings 
against k-bit deletions using a hashing procedure. Section[4]describes how to efficiently encode any string as 
a pattern rich string. Section [5] combines the results of the previous sections to prove Theorem |T] Section [6] 
describes how to modify the code so that it works efficiently on the k-bit insertion and deletion channel. 
Section [7] suggests what would need to be done to improve redundancy past Oik 1 logklog//) using our 
methods. Appendix lAl proves that essentially the best linear k-bit deletion code is the (k + 1 (-repetition 
code. 

2 Preliminaries 

A subsequence in a string x is any string obtained from x by deleting one or more symbols. In contrast, a 
substring is a subsequence made of several consecutive symbols of x. 

Definition 1. Let k be a positive integer. Let a\ : {0, I}" —>■ 2^ 0,1 ^ * be the function which maps an binary 
string s of length n to the set of all subsequences of s of length n — k. That is, 0 ^( 5 ) is the set of all possible 
outputs through the k-bit deletion channel. 

Definition 2. A two //-bit strings s\ and S 2 are k-confiisable if and only if o^Ai) n cr^(.v 2 ) f 0. 

We now state and develop some basic ingredients that our construction builds upon. Specifically, we 
will see some simple constructions of hash functions such that the knowledge hash(x) and an arbitrary 
string y E 0 )t(x) allows one to reconstruct x. Our final deletion codes will use these basic hash functions, 
which are either inefficient in terms of size or complexity, to build hashes that are efficient both in terms of 
size and computation time. These will then be used to build deletion codes, after protecting those hashes 
themselves with some redundancy to guard against k deletions, and then including them also as part of the 
codeword. 

We start with an asymptotically optimal hash size which is inefficient to compute. For runtimes, we 
adopt the notation O& (/(«)) to denote that the runtime may depend on a hidden function of k. 

Lemma 2. Fix an integer k ^ 1. There is a hash function hashi : {0,1}” —>• {0, 1 } m for m f 2klog// + 0(1), 
computable in 0/fn 2k 2 n ) time, such that for all x E {0,1}", given hashi (x) <■tnd an arbitrary y E ofx), the 
string x can be recovered in Ok{n 2k 2 n ) time. 
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Proof. This result follows from an algorithmic modification of the methods of lUJ. It is easy to see that 
for any //-bit string a:, G/ ; (a) \ f n k . Additionally, for any (// — A:)-bit string y, the number of n- bit strings s 
for which y E (5k{s) is at most 2‘(Z) < 2n k . Thus, any //-bit string x is confusable with at most 2 n 2k others 
strings. Thus, there exists a deterministic procedure in Ok(n lk 2 n ) time which (2 n 2k + l)-colors these strings. 
We can define hashi (jc) to be the color of x. 

Given such a hash and a (, n — k ) bit received subsequence y E o* (jc) , the receiver can in Ok{n 2k 2 n ) 
determine the color of all strings s for which y E ofs). By design, exactly one of these stings has the color 
of the hash, so the receiver will be able to successfully decode x, as desired. □ 

We now modify the above result to obtain a larger hash that is however faster to compute (and also 
allows recovery from deletions in similar time). 

Lemma 3. Fix an integer k f 1. There is a hash function hasli 2 : {0,1}” —>• {0,1 } m form ~ 2kn log log ///log/? 
computable in Ok{n 2 (\ogn) 2k ) time, such that for all s E {0,1}", given hashes) and an arbitrary y E Ok(s), 
the string s can be recovered in Ok{n 2 (\ogn) 2k ) time. 

Proof. We describe how to compute hashes) for an input s E {0,1}”. Break up the string into consecu¬ 
tive substrings ,s’i,... ,,sy of length [log//] except possibly for ,sy which is of length at most [log//]. For 
each of these strings, by Lemma[2]we can compute in O/fn (\ogn) 2k ) time the string hash] (.v,) of length 
~ 2k log log//. Concatenating each of these hashes, we obtain a hash of length ~ 2kn log log n /log // which 
takes Ok(n 2 (logn) 2k ) time to compute. The decoder can recover the string s' in Ok(n 2 (log n) 2k ) time by 
using the following procedure. For each of /' E {1,... ,?/} if ji and j\ are the starting and ending positions of 
Si in s, then the substring between positions ji and j\ — k in s' must be a subsequence of s,. Thus, applying the 
decoder described inO we can in Ok{n(\ogn) 2k ) time recover .v,. Thus, we can recover s in O/fn 2 (log n) 2k ) 
time, as desired. □ 

We will also be using Reed-Solomon codes to correct k symbol errors. For our puiposes, it will be 
convenient to use a systematic version of Reed-Solomon codes, stated below. The claimed runtime follows 
from near-linear time implementations of unique decoding algorithms for Reed-Solomon codes, see for 
example fl4l . 

Lemma 4. Let k < // be positive integers, and q be a power of two satisfying n + 2k f q f 0(n). Then 
there exists a map RS : —y F 2k , computable in Ok (//(log//) 4 ) time, such that the set { (x, RS(x)) | x E F"} 
is an error-correcting code that can correct k errors in 0,r (//(log//) 4 ) time. In particular, given RS(a') and 
an arbitrary z at Hamming distance at most kfrom x, one can compute x in Ok(n(logn ) 4 ) time. 


3 Deletion-correcting hash for mixed strings 

In this section, we will construction a short, efficiently computable hash that enables recovery of a string a 
from ^-deletions, when a is typical in the sense that each short pattern occurs frequently in x (we call such 
strings mixed). 
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3.1 Pattern-rich strings 


We will use n for the length of the (mixed) string to be hashed, and as always k will be the number of 
deletions we target to correct. The following parameters will be used throughout: 

d = L2000(M:(log&) 2 lognJ and m = [log A: + log log (A: + 1) + 5] . (3) 

It is easy to see that the choice of m satisfies 

2 m > 2k{2m — 1) . (4) 


(Indeed, we have 

2 m ^ 32A:log(A:+ 1) > 2£(151og(£:+ 1)) > 2A: (2 log A:+ 2 log log (A: + 1) + 11) > 2k(2m — 1) .) 

We now give the precise definition of mixed strings. 

Definition 3. Let p and s be binary strings of length m and n, respectively, such that m < n. Define a p-split 
point of s be an index i such that p = s,Sj [ ... i. 

Definition 4. We say that a string s £ {0,1}" is k-mixed if for every p £ {0,1}'”, every substring of s of 
length d contains a p-split point. Let ./£ n be the set of £-mixed strings of length n. 

3.2 Hashing of Mixed Strings 

The following is our formal result on a short hash for recovering mixed strings from k deletions. 

Theorem 5. Fix an integer k f 2. Then for all large enough n, there exists b = O(k 2 logk log n) and a 
hash function H m j xe d : w£ n —>• {0,1}* and a deletion correction function G m i xe d : {0, \} n ~ k x {0,1}*^ 
{0,1}" U {Fail}, both computable in Ok{n{\ogn) 4 ) time, such that for any k-mixed s £ {0,1}", and any 
s £ Ok(s), we have Lmixed(4 ■ F m \ yvx \ (.v)) — s. 

Definition 5. If 5 £ {0,1}", s' € c> k (s), and p £ {0, 1}'", we say that s' is p-preserving with respect to s if 
there are some I ^ /1 f ... f ir f n such that s' is obtained from s by deleting .v,-, ,... ,Si k and so that: 

1 . no substring of s equal to p contains any of the bits at positions ij 

2. s and s' have an equal number of instances of substrings equal to p 

Intuitively, s' is p-preserving with respect to s if we can obtain s' from s by deleting k bits without 
destroying or creating any instances of the pattern p. 

We first prove the following lemma. 

Lemma 6. Fix an integer k f 2. Then for sufficiently large n, there exists a hash function 
^pattern : w£ n x {0i l} 1 ” —> {0, l} 2 *(n°g«l+ 1 ) and deletion correction function 
^pattern : {0, l}""* X {0, f }2*(Dog«1 +1) x {Q, 1}'" {0, 1}" U {Fail}, 

both computable in Ok(n(logn) 4 ) time, such that for every pattern p £ {0, l} m , every k-mixed s € {0,1}", 
and an arbitrary s' € <Jk{s) that is p-preserving with respect to s, one has 

gpattern(T i ^pattern (s, p), p) =S . 
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Proof. We first define the hash function /;pattern as follows. 

Assume we are given a mixed string r G .M n and a pattern p G {0,1}'". Let a\..... a u be the p-split points 
of r. Then we let strings wo,... ,w u be defined by wo = ro • • ■ r Ql _i, w u = r 0u ■ ■ • r„_i, and for 1 ^ i ^ u— 1, 
Wj = r cl] ■ ■ ■ r fl . +1 _ i- Thus {w ; } are the strings that r is broken into by splitting it at the split points. By the 
definition of a mixed string, each wj has length at most d (as defined in ©I. 

We let £j be the length of wj, let vj be wj padded to length d by leading 0’s, and let yj = hastbfv’/) as 
defined in Lemma[3] with the binary representation of the length of ij appended. We can compute yj in time 
Ok( (log /t ) 2 (log log n ) 2k ) and yj has length v satisfying 



(5) 


for large enough n. 

Let xj be the number whose binary representation is yj. Then based on the length of yj, we have that 
xj < n. Let q be the smallest power of 2 that is at least n + 2k. We then apply lemma[4]to x = (xi,... ,x n ) G 
F” (with all those that are not defined being assigned value 0) to obtain (yi,... ,yik) = RS(x) G ¥j k . For 
1 ^ y ^ 2k, let Sj be the binary representation of yy , padded with leading 0’s so that its length is (log /?] + 1. 

Finally, we define the hash value 


^pattern (A P) —Sj ■ ■ ■ S2 k • 


Clearly, the length of /j p attem(^,F) equals 2k(|"logn] + 1). 

To compute gp a ttem(s , ) ^ ) p)> where s' is a subsequence of 5 that is p-pattern preserving with respect to s, 
we split h into 2k equal-length blocks, calling them Si,..., Sik ■ We compute {x\.... .x' H ) from s' in the same 
way that we computed (jci,. .. ,jt n ) from s when defining /; pattern (■'>'• P)- Now, assuming li = h pattern (s,p), 
there are at most k values of j such that x'- f xj, since there are at most k deletions. We can use Lemma 0] 
and Si ,...,Sjk to correct these k errors. From a corrected value of xj, we can obtain the value of w'- and 
Ij. Since £j is the length of wj, we can use it to remove the proper number of leading zeroes from w'- 
and obtain wj. Thus we can restore the original s in Ok(n(\ogn) 4 + n(log/;) 2 (loglog/i) 2/< ) time. Since 
(log log 72 )“^ Gj Ok ( I j + Oflog /i) HI the overall decoding time is CL (n(log nf). □ 

With the above lemma in place, we are now ready to prove the main theorem of this section. 

Proof, (of Theorem[5]) Given a mixed string s G -M n , the hash // m j xe d ( s ) is computed by computing h pattern (s, p) 
from Lemma[6]for each pattern p G {0, l} m and concatenating those hashes in order of increasing p. For the 
decoding, to compute G m jxed(s / ,//mixed(s)) for a s' G 0 ^( 5 ), we run g pa ttem from Lemma [6] on each of the 
2 m subhashes corresponding to each p G {0,1}'", and then take the majority (we can perform the majority 
bitwise so that it runs in Ok (n) time). When deleting a bit from s, at most (2m — 1) patterns p are affected 
(since at most m are deleted and at most m — 1 are created). Thus k deletions will affect at most k(2m — 1) 
patterns of length m. Since m was chosen such that 2 m > 2k(2m — 1), we have that s' is /^-preserving with re¬ 
spect to s for a majority of patterns p. Therefore, we will have y pattern f ^pattern (p-P)- P) = s for a majority 
of patterns p, and thus G m j xe d reconstructs the string s correctly. □ 

'For instance, (loglogn) 2 * 0(2 2k ~ +logn), because either loglogn ^ 2 k in which case (loglogn) 2 * ^ 7r kl , or (loglog«) 2i: ^ 


(loglogn) 21 ^ 1 ^ 10 ®" < O(logn). 
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4 Encoding into Mixed Strings 


The previous section describes how to protect mixed strings against deletions. We now turn to the question 
of encoding an arbitrary input string s £ {0,1}" into a mixed string in .JT n . 

Definition 6. Let p : {0,1}" x {0,1} L —>• {0,1}" be the function which takes a string s of length n and a 
string t, called the template, of length L, and outputs the bit-wise XOR of s with t concatenated \n/L\ times 
and truncated to a string of length n. 

We will apply the above function with the parameter choice 

L = \rn2 m (\og(n2 m ) + 1)] < [10000k(logk) 2 logn] - 1 . (6) 

Equation [6] follows since for k f 2 

|~m2 m (log(n2'") +1)~| ^ (logk + loglog(k+ 1) +6)(64klog(k+ l))(logn + m + 1) 

^ (81ogk)(128klogk)(21og«) 

^ [10000k(logk) 2 (log«)l-l 

Notice that for all s £ {0,1}" and t £ {0,1} L , p(p(s,t),t) = s. Notice also that p is computable in 
0(n ) time. It is not hard to see that for any s £ {0,1}", the string p{s,t) for a random template t £ {0, I } L 
will be k-mixed with high probability. We now show how to find one such template t that is suitable for s, 
deterministically in near-linear time. 

Lemma 7. There exists a function T : {0,1}" —>• {0,1} L such that for all s £ {0,1}", p (s. T (s)) £ ..W, n . Also, 
T is computable in 0(k 3 ( logk) 3 n log/r) = O^nlogn) time. 

Proof. For a given string s and template t, we say that a pair (i, p) £ {0,1,..., [n/L\ — 1} x {0, l} m is an 
obstruction to r = p(s,t ) being mixed if the substring of r,x + i • • • r,x+x does not include the pattern p as a 
substring. 

We will choose T ( 5 ) algorithmically. For 0 ^ j ^ \L/m\ — 1, at the beginning of step j we have we will 
have bj potential obstructions. Clearly, bo = [n/L\2 m . At step j, we will specify the values of tj m+ 1 • • • tj m+m . 
If we chose these bits randomly, then E[&/+i] = (1 — 2~ m )bj. Thus we can check all of the 2 m possibilities 
and find some way to specify the bits so that bj + \ ^ (1 — 2 ~ m )bj. This will then give us that 

b[L/m\ ^ (1 -2- m )L L H {\n/L\2 m ) ^ ~ m n 2 m < 1 

by our choice of L in ©. 

Thus we can find a template T(s) with no obstructions in [L/m\ steps. The definition of an obstruction 
tells us that then every substring of r = p (s, T (.v)) of length 2 L f [ 20000A:(log A) 2 1 og n J = cl contains every 
pattern of length m, so the string r £ 

Let us estimate the time complexity of finding T(s). Fix a step j, 0 ^ / < [L/m\. Going over all possi¬ 
bilities of tj m+ 1 • • • t jm-m takes 2 m time, and for each estimating the reduction in number of potential obstruc¬ 
tions takes 0(n2 m ) time. The total runtime is thus 0(nA m L/m) = G(/;8'" log(«2"')) = 0(k 3 (logk) 3 nlogn). 

□ 


5 The Encoding/Decoding Scheme: Proof of Theorem U 


Combining the results from Sections [3] and 01 we can now construct the encoding/decoding functions Enc 
and Dec which satisfy Theorem Q] But first, as a warm-up, we consider a simple way to obtain such maps 
for codewords of slightly larger length N ^ n + 0(k 3 log Z log n) (i.e., the redundancy has a cubic rather than 
quadratic dependence on k). Let s be the message string we seek to encode. First compute t = T(s) and 
r = jl{s.t) as per Lemma [7] Then, define the encoding of s as 

Enc(s) = (r, rep k+1 (t), rep k+l {H m]xed (r) } , 

where // m ixed(') i s the deletion-coiTecting hash function for mixed strings from Theorem [5j and rep^., , 
is the [k + l)-repetition code which repeats each bit (Z + 1) times. Since //mixed ( r ) is the most intensive 
computation, Enc(.v) can be computed in ()ifn(logn) 4 ) time and its length is 

n+ (k+\)L+ (k+ 1) |// m ixed ( r ) | < « + (Z+ l)0(Z(logZ) 2 logn) + (k+ l) 0(k 2 log klogn) 

^n + 0(Z 3 logZlog/i). 

We now describe the efficient decoding function Dec. Suppose we receive a subsequence s' € <7/ { f Enc(.v)), 

First, we can easily decode t and //mixed (/), since we know the lengths of t and //mixed(r) beforehand. Also 
the first n — k symbols of s' yield a subsequence r' of r of length n — k. Then, as shown in Theorem [5] r = 
f/mixed//mixed(r)) and this can be computed in 0/t(n(logn) 4 ) time. Finally, we can compute s = ju(r,t), 
so we have successfully decoded the message s from s', as desired. 

Now, we demonstrate how to obtain the improved encoding length of n + ()(k 2 logZlog/r). The idea is 
to use Lemma[3]to protect t and //mixed (f) with less redundancy than the naive (k + l)-fold repetition code. 

Proof, (of Theorem [Q) Consider the slightly modified encoding 

Enc(s) = (r, t, rep fc+1 (hash 2 (t)), //mixed(r), rep fe+1 (hash 2 (// m ixed(''))) ) • (7) 

The resulting codeword can be verified to have length 0(k 2 (logk)(log«)) for large enough n: the point is 
that hashiO applied to t and // m ixed(^)> which are (W-flog/i) long strings, will result in strings of length 
Ok(logn), so we can afford to encode them by the redundancy (k + l ) repetition code, without affecting the 
dominant Oyt(logn) term in the overall redundancy. 

Since we know beforehand, the starting and ending positions of each of the five segments in the code¬ 
word (0, we can in 0(n) time recover subsequences of r, t, rep lt+1 (hash 2 (f)), //mixed ( r )> and rep^ + | (hash2(// m ixed( r )) 
with at most k deletions in each. By decoding the repetition codes, we can recover hash 2 (t) and hash 2 (// m i X ed ( r )) 
in 0(n ) time. Then, using the algorithm described in Lemma[3] we can recover l and // m ixed( r ) in Otin' 2 (logn') 2k ) 
time where n' = max(L, |// m ixed( r )|) = 0(k 2 \ogk\ogn). Once t and //mixedM are recovered, we can proceed 
as in the previous argument and decode s in C4(n(logn) 4 ) time, as desired. □ 


6 Efficient Algorithm for Correcting Insertions and Deletions 

By a theorem of Levenshtein |Q]], we have that our code works not only on the /-bit deletion channel but 
also on the Z-bit insertion and deletion channel. The caveat though with this theorem is that the decoding 
algorithm may not be as efficient. In this section, we demonstrate a high-level overview of a proof that, with 


9 


some slight modifications, the code we constructed for k deletions can be efficiently decoded on the k -bit 
insertion and deletion channel. Although the redundancy will be slightly worse, its asymptotic behavior will 
remain the same. 

To show that our code works, we argue that suitable modifications of each of our lemmas allow the result 
to go through. 

• Lemma[2]works for the k-bit insertion and deletion channel by Levenshtein’s result (T). Since encod¬ 
ing/decoding were done by brute force the efficiency will not change by much. 

• To modify Lemma [3] we show that the code which corrects 3 k deletions can also correct k insertions 
and k deletions nearly as efficiently. If the codeword transmitted is .Vj...., s n / where each Sj is of length 
at most [logn] then in the received word, if Sj was supposed to be in positions i a to //,, then positions 
i a + k to ij, — k must contain bits from s, except possibly for k spurious insertions. Using Lemma [2] 
modified for insertions, we can restore s, using brute force, and thus we can restore the original string 
with about the same runtime as before. 

• Lemma 0] and Lemma [6] do not change because the underlying error-correcting code does not depend 
on deletions or insertions. 

• Theorem [5] does not change because the hash works in an error-correcting way and the number of 
affected “foiled” patterns is roughly the same. 

• Lemma[7]does not change. 

• Theorem [T| needs some modifications. The encoding has the same general structure except we use 
the hash functions of the modified lemmas and we use a (3 k + l)-fold repetition code instead of a 
(k+ l)-fold repetition code. That is, our encoding is 

00 ) = 0, t, rep 3jt+1 (hash 2 (f)), //mixed 0), rep 3yt+1 (hash 2 (// mixed (r)))) • 

In the received codeword we can identity each section with up to k bits missing on each side and k 
spurious insertions inside. In linear time we can correct the (3k + l)-repetition code by taking the 
majority vote on each block of length 3k+ 1. Thus, we will have hash 2 (?) and hash 2 (// m i X edO)) from 
which we can obtain t, //mixed 0), and finally r and s in polynomial (in fact, near-linear) time as in the 
deletions-only case. 

Thus, we have exhibited an efficient code encoding n bits with 0(k 2 logklogn) redundancy and which 
can be corrected in near-linear time against any combination of insertions and deletions totaling k in number. 


7 Concluding remarks 

In this paper, we exhibit a first-order asymptotically optimal efficient code for the k -bit deletion channel. 
Note that to improve the code length past n + 0(k 2 logklogn), we would need to modify our hash function 
//mixed 0) so that either it would use shorter hashes for each particular pattern p or it would require using 
fewer patterns p. The former would require distributing the hash information between different patterns, 
which may not be possible since the patterns do not synchronize with each other. An approach through the 
latter route seems unlikely to improve past n + 0(k 2 log«) since an adversary is able to “ruin” k essentially 
independent patterns because the string being transmitted is /-mixed. 
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Another interesting challenge is to give a deterministic one-way protocol with poly (k log n) communica¬ 
tion for synchronizing a string a: € {0,1}" with a subsequence y € o k (x) (the model discussed in Section [131 ) . 
The crux of our approach is such a protocol when the string x is mixed, but the problem remains open when 
x can be an arbitrary n-bit string. 

Another intriguing question is whether there is an extension of the Varshamov-Tenengolts (VT) code 
for the multiple deletion case, possibly by using higher degree coefficients in the check condition(s) (for 
example, perhaps one can pick the code based on £” =1 i a xt for a = 0,1,2 ,... ,d for some small constant d). 
Note that this would also resolve the above question about a short and efficient deterministic hash for the 
synchronization problem. However, for the case of two deletions there are counterexamples for d ^ 4, and 
it might be the case that no such bounded-degree polynomial hash works even for two deletions. 
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A Linear deletion codes cannot have rate better than 

In this section, we generalize the work of ff5l (who did the k = 1 case) to demonstrate that the best asymp¬ 
totic rate achievable by a linear k-bit binary deletion code is \/(k + I). Note that this is achieved by the 
k+ 1-repetition code. 

Theorem 8. LetC be a n-bit linear code for the k-bit deletion channel. Then, dim(C) fn/(k + 1) + (k + I) 2 . 
Proof. Define C° = C.C 1 ... .C k to be subspaces of F'] of dimension dim(C) such that 

C = {(x i+ i,... ,x n ,x\,... ,Xi) \ (x\,... ,x n ) EC}. 

Lemma 9. For all 0 A i < j f k, 

dim(C'nC ; ) ^ ged (j — i,n). 

Proof. If z € C‘ fl C-i, then there exists x,y £ C such that 

Z= (x i+ i,...,x„,x 1 ,...,x n ) = (y J+1 ,...,y n ,y 1 ,...,y J ). 

Since j > i, we have that (jc,-+i, . ...x n -j + i) = (yj+i, ■ ■ ■ ,y„). Thus, when x and y are passed through the te¬ 
dded on channel, they could output the same result since k f j. Thus, x = y. Hence, X£ = xr. 7 _; (indices 
modulo n) for all £ € {1,... ,n}. Thus, there are at most 2 gcd F~'-«) choices for x, so dim(C' fl C ; ) F gcd( / — 
i,n). □ 
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From the lemma, we can see that 


n > dim 



> (k+ l)dim(C) 


> (k+ l)dim(C) 


k k 


L £ dim(C'nG') 

(=0 j=i +1 

(^+ 1 ) 3 . 


Thus, dim(C) ^ (k+ 1 ) 2 + n/(k+ 1), as desired. 


□ 


Remark. If we let n be a prime, then dim (C D C 7 ) ^ gcd( / — i. n) = 1. Furthermore, the only possible non¬ 
trivial intersection is 11... 1. Thus, the sum of these k+ 1 vector spaces would have to have dimension at 
least (k + l)(dim(C) — 1) + 1, from which we can get that dim(C) ^ (n + k)/(k+ 1). 
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